Oladotun Oladimejij
In the past two decades along with the rise of the internet there has been a rise in online dating. Even though online dating is very popular, having conversations with your friends or looking on social media you will probably hear people say online dating is superficial and does not really result in long lasting relationships.
Throught out this tutorial I would like to analyze how couples who met online relationships measure up to couples who did not meet online, by anaylzing features like relationship quality, age, the year they met, and other demographics. I also want to see if we will be ablle to learn how long a relationship will last based on how they met, as well as other features mentioned earlier.
According to the The Virtues and Downsides of Online Dating, 30% of U.S. adults say they have used a dating site or app. And of those who use online dating as means to meet people, about six in ten people say their experience was positive, where they were able to find people who they were attracted to and found people that have shared interests. Which means at least short term online dating might not be as badd
That being said I want to examine how long a relationship that starts online can compare with more traditional ways of starting relationships.
import warnings
# turned off warnings to make markdown prettier.
warnings.filterwarnings('ignore')
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib
import matplotlib.pyplot as plt
from sklearn import svm, metrics
from sklearn.svm import LinearSVR, SVR
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
sns.set_theme(style="ticks", color_codes=True)
This data is a collection of survey responses collected by Standford in 2017, also building of the data they collected in 2009, in which respondents get to describe how they met their partner, their relationship, and if and why they stayed together, including features like relationship_quality, the year they met, and more. This dataset has 3,510 survey respondents, and 285 columns.
Dataset used: https://data.stanford.edu/hcmst2017#download-data
The first thing we are going to do is to pandas to read in the dta file and save it as a dataframe. Then take a peek at the head. There are over 200 columns most of which we will not use, but I found it easier to not drop the unneeded columns and later just put the needed columns into one dataframe with needed columns, but for now lets leave as it is.
data = pd.io.stata.read_stata('HCMST 2017 fresh sample for public sharing draft v1.1.dta')
data.head()
| CaseID | CASEID_NEW | qflag | weight1 | weight1_freqwt | weight2 | weight1a | weight1a_freqwt | weight_combo | weight_combo_freqwt | ... | hcm2017q24_met_through_family | hcm2017q24_met_through_friend | hcm2017q24_met_through_as_nghbrs | hcm2017q24_met_as_through_cowork | w6_subject_race | interracial_5cat | partner_mother_yrsed | subject_mother_yrsed | partner_yrsed | subject_yrsed | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2 | 2014039 | Qualified | NaN | NaN | 0.8945 | NaN | NaN | 0.277188 | 19240.0 | ... | no | no | no | no | White | no | 12.0 | 14.0 | 12.0 | 14.0 |
| 1 | 3 | 2019003 | Qualified | 0.9078 | 71115.0 | NaN | 0.9026 | 70707.0 | 1.020621 | 70841.0 | ... | no | no | no | yes | White | no | 12.0 | 16.0 | 17.0 | 17.0 |
| 2 | 5 | 2145527 | Qualified | 0.7205 | 56442.0 | NaN | 0.7164 | 56121.0 | 0.810074 | 56227.0 | ... | no | no | no | no | White | no | 9.0 | 7.5 | 14.0 | 17.0 |
| 3 | 6 | 2648857 | Qualified | 1.2597 | 98682.0 | 1.3507 | 1.2524 | 98110.0 | 0.418556 | 29052.0 | ... | no | no | no | no | White | no | 16.0 | 12.0 | 12.0 | 12.0 |
| 4 | 7 | 2623465 | Qualified | 0.8686 | 68044.0 | NaN | 0.8636 | 67652.0 | 0.976522 | 67781.0 | ... | no | no | yes | no | White | no | 14.0 | 17.0 | 16.0 | 16.0 |
5 rows × 285 columns
There are alot of columns that are named to correspond to the number of the question in the survey, so I renamed them to columns that are more intuitive. I grouped the columns into how they renamed the columns by met and other, because I wanted to use the met.vallues() to iterate to the dataframe and group data visualisation and grouping in later steps
met = {"hcm2017q24_met_online":"met_online", "hcm2017q24_met_through_family":"met_family",
"hcm2017q24_met_through_friend":"met_friends", "hcm2017q24_church":"met_church",
"hcm2017q24_met_as_through_cowork":"met_coworkers", "hcm2017q24_met_through_as_nghbrs":"met_neighbors",
"hcm2017q24_mil":"met_military", "hcm2017q24_customer": "met_customer_client",
"hcm2017q24_party": "met_party", "hcm2017q24_school": "met_primary_secondary_school",
"hcm2017q24_vacation": "met_vacation", "hcm2017q24_business_trip": "met_buisness_trip",
"hcm2017q24_bar_restaurant": "met_bar_or_resturant", "hcm2017q24_public":"met_in_public",
"hcm2017q24_vol_org":"met_volunteer_organization", "hcm2017q24_college":"met_college",
"hcm2017q24_blind_date":"met_blind_date", "hcm2017q24_single_serve_nonint":"met_in_person_dating_service"
}
others = {"w6_q21a_year":"year_met",
"w6_q21a_month":"month_met",
"w6_relationship_quality":"relationship_quality",
"w6_q21b_year": "year_relationship",
"w6_q21b_month": "month_relationship",
"relate_duration_at_w6_years": "relationship_duration",
"w6_q21c_year":"year_lived_together",
"w6_q21c_month": "month_lived_together",
"Past_Partner_Q1":"got_married",
"w6_q15a4_truncated":"country_met",
"time_from_met_to_rel":"time_from_met_to_relationship",
"w6_relationship_end_mar":"how_marriage_end",
"w6_relationship_end_nonmar":"how_non_marriage_end",
"ppgender":"gender_of_respondent"}
data = data.rename(columns=met, errors="raise")
data = data.rename(columns=others, errors="raise")
ways_met = [*met.values()]
I need to transfer the qualitative responses into quantitative so I can create an input that can be respresented as a vector in the machine learning portions later.
The first I will encoded is how the couple met. The data already separates how the couples met into separate columns where each entry is either "yes" or "no" but I will code yes = 1 and no = 0.
The function coded_values() takes in a data frame and a column in the dataframe specifically referring to the column indicating how they met and creates the an array that appends 1 if the value in that row in the column is 'yes' and 0 if the value is 'no'. The function add_coded_values() takes in a dataframe with the list of ways_met and changes all the values in ways met to a coded value. The way we get the ways_met list is from the values of the met dict defined earlier. Where all missing or null values set to NaN
def coded_values(df,way):
tmp= []
for idx,row in df.iterrows():
if df.at[idx,way] == 'yes':
tmp.append(1)
elif df.at[idx,way] == 'no':
tmp.append(0)
else:
tmp.append(np.nan)
#coded = way+'_coded'
df[way] = tmp
return df
def add_coded_values(df,ways_met):
for way in ways_met:
coded_values(data,way)
add_coded_values(data,ways_met)
I will be coding the relationship quality in two ways. The first being from 0 to 4, where 0 represents a very poor relationship and 4 represents an excellent relationship. Where missing values as NaN.
quality = []
for idx,row in data.iterrows():
q = str(data.at[idx,"relationship_quality"])
x = 0
if q == "very poor":
x= 0
elif q == "poor":
x =1
elif q == "fair":
x =2
elif q == "good":
x =3
elif q == "excellent":
x =4
else:
x = float("nan")
quality.append(x)
data["relationship_quality_coded"] = quality
I also encoding the realationship quality as a vector, to be use as an input for machine learning prediction in later steps. It sets the value equal to 1 if the realtionship quality is that type for example "fair" and 0 if not.
quality_very_poor = []
quality_poor = []
quality_fair = []
quality_good = []
quality_excellent = []
for idx,row in data.iterrows():
q = str(data.at[idx,"relationship_quality"])
if q == "very poor":
quality_very_poor.append(1)
quality_poor.append(0)
quality_fair.append(0)
quality_good.append(0)
quality_excellent.append(0)
elif q == "poor":
quality_very_poor.append(0)
quality_poor.append(1)
quality_fair.append(0)
quality_good.append(0)
quality_excellent.append(0)
elif q == "fair":
quality_very_poor.append(1)
quality_poor.append(0)
quality_fair.append(1)
quality_good.append(0)
quality_excellent.append(0)
elif q == "good":
quality_very_poor.append(1)
quality_poor.append(0)
quality_fair.append(0)
quality_good.append(1)
quality_excellent.append(0)
elif q == "excellent":
quality_very_poor.append(1)
quality_poor.append(0)
quality_fair.append(0)
quality_good.append(0)
quality_excellent.append(0)
else:
quality_very_poor.append(float("nan"))
quality_poor.append(float("nan"))
quality_fair.append(float("nan"))
quality_good.append(float("nan"))
quality_excellent.append(float("nan"))
data["relationship_quality_very_poor"] = quality_very_poor
data["relationship_quality_poor"] = quality_poor
data["relationship_quality_fair"] = quality_fair
data["relationship_quality_good"] = quality_good
data["relationship_quality_excellent"] = quality_excellent
Below I will convert some colums to numeric values for data analysis and creation of new columns.
data['age_when_met'] = data['age_when_met'].astype('float32')
data['relationship_duration'] = data['relationship_duration'].astype('float32')
data['country_met'] = data['country_met'].astype('str')
data["year_met"] = pd.to_numeric(data["year_met"],errors='coerce').fillna(data["year_met"])
data["ppage"] = pd.to_numeric(data["ppage"],errors='coerce').fillna(data["ppage"])
data["w6_q9"] = pd.to_numeric(data["w6_q9"],errors='coerce').fillna(data["w6_q9"])
data["ppage"].replace({-1: float("nan")}, inplace=True)
data["w6_q9"].replace({-1: float("nan")}, inplace=True)
data["age_diff"] = abs(data["ppage"] - data["w6_q9"])
Above I created the column that will take the age difference between couples, because I wanted to determine the relationship quality and duration between the age difference.
Below I am just adding column that says the way the couple met, so i will be able to group the dataframe by ways met later.
way_met_column = []
for idx,row in data.iterrows():
found = False
for way in ways_met:
if data.at[idx,way] == 1:
found = True
way_met_column.append(way)
break
if found == False:
way_met_column.append(np.nan)
data['ways_met'] = way_met_column
After initial tidying this is how the dataframe looks.
data.head()
| CaseID | CASEID_NEW | qflag | weight1 | weight1_freqwt | weight2 | weight1a | weight1a_freqwt | weight_combo | weight_combo_freqwt | ... | partner_yrsed | subject_yrsed | relationship_quality_coded | relationship_quality_very_poor | relationship_quality_poor | relationship_quality_fair | relationship_quality_good | relationship_quality_excellent | age_diff | ways_met | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2 | 2014039 | Qualified | NaN | NaN | 0.8945 | NaN | NaN | 0.277188 | 19240.0 | ... | 12.0 | 14.0 | NaN | NaN | NaN | NaN | NaN | NaN | 4.0 | met_online |
| 1 | 3 | 2019003 | Qualified | 0.9078 | 71115.0 | NaN | 0.9026 | 70707.0 | 1.020621 | 70841.0 | ... | 17.0 | 17.0 | 4.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 3.0 | met_coworkers |
| 2 | 5 | 2145527 | Qualified | 0.7205 | 56442.0 | NaN | 0.7164 | 56121.0 | 0.810074 | 56227.0 | ... | 14.0 | 17.0 | 3.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 2.0 | met_online |
| 3 | 6 | 2648857 | Qualified | 1.2597 | 98682.0 | 1.3507 | 1.2524 | 98110.0 | 0.418556 | 29052.0 | ... | 12.0 | 12.0 | NaN | NaN | NaN | NaN | NaN | NaN | 2.0 | met_online |
| 4 | 7 | 2623465 | Qualified | 0.8686 | 68044.0 | NaN | 0.8636 | 67652.0 | 0.976522 | 67781.0 | ... | 16.0 | 16.0 | 4.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | met_neighbors |
5 rows × 293 columns
I will be using these functions below to plot the data.
The met_freq() and met_freq2() will tell how many times a way the a couple met a specific way. This will be useful for the bar graphs make_graphs() will generate later. met_freq() counts the frequency if values were numeric and met_freq2() counts values of str values.
def met_freq(met_dict,df,way):
met_dict[way] = 0
for idx,row in df.iterrows():
if df.at[idx,way] == 1:
met_dict[way] += 1
def met_freq2(met_dict,df,way):
met_dict[way] = 0
for idx,row in df.iterrows():
if df.at[idx,way] == 'yes':
met_dict[way] += 1
The function make_graphs() separates the graphs by decade and shows the most popular ways that couples met eachother in that year.
def make_graphs(times,ways_met,titles):
i = 0
row = 0
col = 0
for timedf in times:
met_counter = {}
for way in ways_met:
met_freq(met_counter,timedf,way)
ways = [*met_counter.keys()]
freq = [*met_counter.values()]
met_freq_df = pd.DataFrame({"W":ways,
"A":freq})
plt.figure(figsize=(8,8))
# make barplot and sort bars
sns.barplot(x='A',
y="W",
data=met_freq_df,
orient = 'h',
order=met_freq_df.sort_values('A',ascending = False).W)
# set labels
plt.xlabel("Amount of People")
plt.ylabel("Ways People met their Significant Other")
plt.title("Ways People met their Significant Other from " + titles[i])
plt.tight_layout()
i +=1
The regression_plot() takes in a dataframe, with x and y values, their corresponding labels, and finally if the data points should be annotated or note. The regression plot with a line of best fit.
%config InlineBackend.figure_format = 'retina'
%matplotlib inline
matplotlib.rcParams['figure.figsize'] = [13,6]
sns.set(color_codes=True)
def regression_plot(df,x,y,title,xlabel,ylabel,should_annotate):
plt.figure(figsize=(10,6))
fig, ax = plt.subplots()
fig = sns.regplot(df[x],df[y],dropna=True)
plt.title(title)
plt.ylabel(ylabel)
plt.xlabel(xlabel)
i = 0
for way in df['ways_met']:
if way == "met_online":
c = "red"
else:
c = "black"
if should_annotate:
ax.annotate(way,(df[x][i],df[y][i]),color=c)
i+=1
Before we start looking at specfic ways the couple met, I would like to see if we could notice trends from data over all. Before we get further I would like to talk about the column 'relationship_duration'. I will tell you that relationship duration was calculated by when there relationship ended subtracted from when the couple first started their relationship. Though if couples were still together at the time of July 2017 that would be considered the " end " for analysis purpose. So we are clear that the relationship duration does not only relate to couples who have broken up.
regression_plot(data,"year_met",
"relationship_duration",
"Year Couple Met vs Relationship Duration",
"Year Couple Met","Relationship Duration (Years)",False)
<Figure size 720x432 with 0 Axes>
This a clear linear relationship, which makes sense, the less you have known someone the smaller the length of your relationship.
regression_plot(data,"age_diff","relationship_duration",
"Age Difference between Couples vs Relationship Duration in years ",
"Age Difference between Couples (Years)","Relationship Duration (Years)",False)
sns.scatterplot(data=data, x="age_diff", y="relationship_duration", hue="ways_met")
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
<matplotlib.legend.Legend at 0x7fb9383d3e50>
<Figure size 720x432 with 0 Axes>
The graph above plots Age Difference between Couples vs Relationship Duration both in years. There seems to be a strong correlation as couples that smaller the age difference the more likely the are to stay together
regression_plot(data,"age_when_met","relationship_duration",
"Age Respondent First Met Partner vs Relationship Duration",
"Age Respondent First Met Partner(Years)",
"Relationship Duration",False)
sns.scatterplot(data=data, x="age_when_met", y="relationship_duration", hue="ways_met")
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
<matplotlib.legend.Legend at 0x7fb929d906d0>
<Figure size 720x432 with 0 Axes>
The graph above plots Age the Respondent first met their partner between vs Relationship Duration both in years. There seems to be a strong correlation as couples that smaller the younger they the more likely the longer their relationship will last.
regression_plot(data,"time_from_met_to_relationship",
"relationship_duration",
"Time Known Before Relationship Started vs Relationship Duration",
"Length of Time From Meet Time (Years)","Relationship Duration (Years)",False)
sns.scatterplot(data=data, x="time_from_met_to_relationship", y="relationship_duration", hue="ways_met")
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
<matplotlib.legend.Legend at 0x7fb928f56280>
<Figure size 720x432 with 0 Axes>
The graph above plots Time before their meeting before relationship starting vs Relationship Duration both in years. It shows that though the regression line is not that steep, it appears the longer it takes for coupkes to get into a relationship the shorter their relationship might be.
matplotlib.rcParams['figure.figsize'] = [10,6]
fig, ax = plt.subplots()
ax = sns.violinplot(x="ways_met", y="relationship_duration",data=data)
ax.set_xlabel("Year")
ax.set_ylabel("Relationship Duration")
ax.set_title("Way Couple Met v Relationship Duration")
plt.xticks(rotation=70)
plt.tight_layout()
I wanted to use a violin plot to see how the way the couples met and their relationship duration. It seems like for couples that met online, they seemed to have a shorter relationship duration than the other methods, but it seems the values seemed to be more widely distributed.
There is a tool in pandas that allows you to cut the dataframe into bins of equal sizes. I did not use that here because I wanted to split the data by decade, and if I used pandas.cut() it would not have represented the data like I wanted it do, it is a bit more work but this can show a part of how our dating patterns as a society changed.
titles = ["1940s","1950s","1960s","1970s","1980s","1990s", "2000s","2010s"]
timedf1999g = data[data["year_met"] > 1999]
timedf2000s = timedf1999g[timedf1999g["year_met"] < 2010]
timedf2010s = timedf1999g[timedf1999g["year_met"] >= 2010]
timedf1900s = data[data["year_met"] <= 1999]
timedf1990s = timedf1900s[timedf1900s["year_met"] >= 1990]
timedf1940s = timedf1900s[timedf1900s["year_met"] < 1950]
timedf0 = timedf1900s[timedf1900s["year_met"] < 1990]
timedf1980s = timedf0[timedf0["year_met"] >= 1980]
timedf1 = data[data["year_met"] < 1980]
timedf1970s = timedf1[timedf1["year_met"] >= 1970]
timedf2 = data[data["year_met"] < 1970]
timedf1960s = timedf2[timedf2["year_met"] >= 1960]
timedf3 = data[data["year_met"] < 1960]
timedf1950s = timedf3[timedf3["year_met"] >= 1950]
times = [timedf1940s,timedf1950s,timedf1960s,timedf1970s,timedf1980s,timedf1990s,timedf2000s,timedf2010s]
At the top of the graph will be the most popular way couples met, and the least populIt will be interesting to see how the 'met_online' category moves, throughout each decade.
make_graphs(times,ways_met,titles)
Meeting online was virtually nonexistent until the 1960s. Which I wonder if that is an error or not because the internet did not become public til the 1990s, there could be other reasons like they had access to the internet before then but that was not how the internet was use in that time, so I am assuming it was either an error made by the respondent or the repondent was not talking about that year, but I checked and the year did actually correspond to years in the 1960s.
You will noticed that in the 1990s is when online dating realing started to take off. Which makes sense, because that is when the interenet became public, in the last decade meeting online became the most popular way to meet someone.
Now we will be grouping the data by the way the couples met and taking the average of the groupings.
waysAvgs = data.groupby("ways_met").mean()
waysAvgs['ways_met']= waysAvgs.index.values
waysAvgs.head()
| CaseID | CASEID_NEW | weight1 | weight1_freqwt | weight2 | weight1a | weight1a_freqwt | weight_combo | weight_combo_freqwt | duration | ... | partner_yrsed | subject_yrsed | relationship_quality_coded | relationship_quality_very_poor | relationship_quality_poor | relationship_quality_fair | relationship_quality_good | relationship_quality_excellent | age_diff | ways_met | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| ways_met | |||||||||||||||||||||
| met_bar_or_resturant | 2195.568862 | 2.229272e+06 | 0.941428 | 73749.140625 | 0.829986 | 0.936037 | 73326.804688 | 0.881268 | 61168.957031 | 412.341317 | ... | 13.751515 | 13.610779 | 3.474074 | 0.992593 | 0.007407 | 0.066667 | 0.281481 | 0.0 | 5.736196 | met_bar_or_resturant |
| met_blind_date | 2041.066667 | 2.125214e+06 | 0.879957 | 68933.710938 | 1.343300 | 0.874893 | 68537.070312 | 0.920787 | 63911.867188 | 435.000000 | ... | 14.000000 | 13.533334 | 3.571429 | 0.928571 | 0.071429 | 0.000000 | 0.214286 | 0.0 | 4.533333 | met_blind_date |
| met_buisness_trip | 2771.800000 | 1.993890e+06 | 1.045950 | 81937.250000 | 0.456600 | 1.039925 | 81465.000000 | 0.969020 | 67259.796875 | 325.200000 | ... | 17.000000 | 16.200001 | 3.500000 | 1.000000 | 0.000000 | 0.000000 | 0.500000 | 0.0 | 3.600000 | met_buisness_trip |
| met_church | 2135.442623 | 2.170797e+06 | 0.960168 | 75217.187500 | 1.676617 | 0.954665 | 74786.085938 | 1.050387 | 72907.546875 | 63.877049 | ... | 14.628099 | 14.401639 | 3.615385 | 1.000000 | 0.000000 | 0.038462 | 0.307692 | 0.0 | 3.652893 | met_church |
| met_college | 1962.454545 | 2.181945e+06 | 0.956831 | 74955.726562 | 0.674600 | 0.951346 | 74526.023438 | 0.974039 | 67608.242188 | 325.136364 | ... | 16.329546 | 16.488636 | 3.567901 | 1.000000 | 0.000000 | 0.049383 | 0.333333 | 0.0 | 3.000000 | met_college |
5 rows × 69 columns
I am adjusting the column labels to correspond to what the column actually represents, which is the average.
avgCol = {
"relationship_duration":"avg_relation_dur",
"relationship_quality_coded": "avg_relationship_quality",
"age_when_met":"avg_age_when_met",
"age_diff":"avg_age_diff",
"time_from_met_to_relationship":"avg_time_from_met_to_relationship"
}
waysAvgs = waysAvgs.rename(columns=avgCol, errors="raise")
When I am taking the average of relationship quality, i am looking at the values I coded from 0 to 4, and taking the average of those values.
regression_plot(waysAvgs,"avg_relationship_quality","avg_relation_dur",
"Average Relationship Quality vs Average relationship Duration in years",
"Average Relationship Quality","Average Relationship Duration (Years) ",True)
<Figure size 720x432 with 0 Axes>
regression_plot(waysAvgs,"avg_age_when_met","avg_relation_dur",
"Average Age When Met vs Average relationship Duration",
"Average Age When Respondent Met Partner","Average Relationship Duration (Years)",True)
<Figure size 720x432 with 0 Axes>
You can see that met_online is an outlier with the lowest Average Relationship Duration and the couples who met online on average tend to be older. The relationship between average relationship and age when they met and they seem to have a neagtive correlation.
regression_plot(waysAvgs,"avg_time_from_met_to_relationship","avg_relation_dur",
"Average Time Known Before Relationship Started vs Average relationship Duration in years ",
"Average Time Known Before Relationship Started (Years)","Average Relationship Duration (Years)",True)
<Figure size 720x432 with 0 Axes>
The graph above also shows met_online as an outlier but the relationship between average time befeore defining the relationship and the average relationship duration does not seem to have very correlate at all as the slope of the regression line is almost horizontal.
regression_plot(waysAvgs,"avg_age_diff","avg_relation_dur",
"Average Relationship Quality vs Average relationship Duration in years ",
"Average Age Difference (Years)","Average Relationship Duration (Years)",True)
<Figure size 720x432 with 0 Axes>
The graph above shows the average age difference between couples and relationship duration and seemed to be negatively correlated, like shown above when looking at the data as a wholw, the greater the age difference the lower the relationship duration.
ax = sns.countplot(x="got_married", hue="ways_met", data=data)
# Put the legend out of the figure
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
<matplotlib.legend.Legend at 0x7fb918bab5e0>
Now I want to learn some patterns. Above is the frequency of people that got married to partner they are talking about in the survey. I notice that people who met online are more likely to not get married other couples. I think this interesting because meeting online is a very new. To see that they are kind of tied to couples that met in primary or secondary school.
ax = sns.countplot(x="relationship_quality",hue="ways_met", data=data)
# Put the legend out of the figure
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
plt.title("Frequency of Relationship Quality Grouped by the way they Met")
Text(0.5, 1.0, 'Frequency of Relationship Quality Grouped by the way they Met')
The graph above shows that most ways that most couples met are happy in their relationship regardless of how they met eachother.
ax = sns.countplot(x="how_non_marriage_end", hue="ways_met", data=data)
# Put the legend out of the figure
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
plt.title("Ways Non Marriages Ended Grouped by the way they Met")
Text(0.5, 1.0, 'Ways Non Marriages Ended Grouped by the way they Met')
Of relationships of people who did not get married and ended the group that broke up the most were couples that met online and couples thet met in primary or secondary school.
g = sns.catplot(x="relationship_quality",y='relationship_duration',
data=data, kind="bar", hue="country_met",
ci="sd", palette="dark", alpha=.6, height=6
)
g.despine(left=True)
g.set_axis_labels("Relationship Quality", "Relationship Duration in years")
g.legend.set_title(" ")
I wanted to analayze if the country the couples met in affect their relationship quality and length of their relationship, it is interesting to note that people that met in the United States, could had more negative view of their relationship than other countries.
I wanted to see how couples met online are affected by features like age difference, age when met, time before they defined the relationship started, and relationship quality to relationship duration.
met_online = data[data["met_online"] == 1]
met_online.head()
| CaseID | CASEID_NEW | qflag | weight1 | weight1_freqwt | weight2 | weight1a | weight1a_freqwt | weight_combo | weight_combo_freqwt | ... | partner_yrsed | subject_yrsed | relationship_quality_coded | relationship_quality_very_poor | relationship_quality_poor | relationship_quality_fair | relationship_quality_good | relationship_quality_excellent | age_diff | ways_met | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2 | 2014039 | Qualified | NaN | NaN | 0.8945 | NaN | NaN | 0.277188 | 19240.0 | ... | 12.0 | 14.0 | NaN | NaN | NaN | NaN | NaN | NaN | 4.0 | met_online |
| 2 | 5 | 2145527 | Qualified | 0.7205 | 56442.0 | NaN | 0.7164 | 56121.0 | 0.810074 | 56227.0 | ... | 14.0 | 17.0 | 3.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 2.0 | met_online |
| 3 | 6 | 2648857 | Qualified | 1.2597 | 98682.0 | 1.3507 | 1.2524 | 98110.0 | 0.418556 | 29052.0 | ... | 12.0 | 12.0 | NaN | NaN | NaN | NaN | NaN | NaN | 2.0 | met_online |
| 12 | 16 | 645963 | Qualified | 0.9694 | 75940.0 | NaN | 0.9638 | 75502.0 | 1.089824 | 75645.0 | ... | 16.0 | 16.0 | NaN | NaN | NaN | NaN | NaN | NaN | 3.0 | met_online |
| 17 | 22 | 2690979 | Qualified | 0.9338 | 73152.0 | NaN | 0.9284 | 72729.0 | 1.049795 | 72866.0 | ... | 12.0 | 12.0 | 4.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 6.0 | met_online |
5 rows × 293 columns
Here I am just gathering how many ways people met online.
ways_int_mtg = ['hcm2017q24_internet_other',
'hcm2017q24_internet_dating',
'hcm2017q24_internet_soc_network',
'hcm2017q24_internet_game',
'hcm2017q24_internet_chat',
'hcm2017q24_internet_org']
met_counter = {}
for way in ways_int_mtg:
met_freq2(met_counter,met_online,way)
ax = sns.barplot(x=[*met_counter.keys()], y=[*met_counter.values()])
ax.set_xlabel("Ways of Meeting Online")
ax.set_ylabel("Number of People")
ax.set_title("Ways Couples Met Online")
plt.xticks(rotation=70)
plt.tight_layout()
As I expected internet dating is the most popular way people on the internet met,which makes sense withthe rise of dating apps like Tinder, Bumble, and others. After interner dating, internet_other follows, and I am not sure what that entails but probably ways that are not as conventional.
ax = sns.countplot(x="relationship_quality", data=met_online)
Most couples that met online view their relationship as excellent and in a positive manner.
g = sns.catplot(x="relationship_quality",y='relationship_duration',
data=met_online, kind="bar", hue="gender_of_respondent",
ci="sd", palette="dark", alpha=.6, height=6
)
g.despine(left=True)
g.set_axis_labels("Relationship Quality", "Relationship Duration in years")
g.legend.set_title(" ")
Looking at the graph aboveThe respondents that were female tended to see their relationship more negatively than than the male respondents did.
g = sns.catplot(x="relationship_quality",y='age_diff',
data=met_online, kind="bar", hue="gender_of_respondent",
ci="sd", palette="dark", alpha=.6, height=6
)
g.despine(left=True)
g.set_axis_labels("Relationship Quality", "Age difference in years")
g.legend.set_title(" ")
g = sns.catplot(x="relationship_quality",y='relationship_duration',
data=met_online, kind="bar", hue="ppagecat",
ci="sd", palette="dark", alpha=.6, height=6
)
g.despine(left=True)
g.set_axis_labels("Relationship Quality", "Relationship Duration in years")
g.legend.set_title(" ")
The age group that sees the relationship poorly more than other groups that met online seem to be in the 25-34 group in 2017 which is interesting because they are the age group that grew up on the internet
regression_plot(met_online,"age_when_met","relationship_duration",
"Age Respondent First Met Partner vs Relationship Duration",
"Age Respondent First Met Partner(Years)",
"Relationship Duration (Years)",False)
<Figure size 720x432 with 0 Axes>
The age when they first met and their relationship correlation does not seem to be very correlated.
regression_plot(met_online,"age_diff","relationship_duration",
"Age Difference Between Partners vs Relationship Duration",
"Age Difference Between Partners (Years)",
"Relationship Duration (Years)",False)
<Figure size 720x432 with 0 Axes>
The regression line seems to be correlated but the plots cluster around the younger values and seem to have a longer relationship duration.
regression_plot(met_online,"time_from_met_to_relationship",
"relationship_duration",
"Time Known Before Relationship Started vs Relationship Duration",
"Length of Time From Meet Time (Years)","Relationship Duration (Years)",False)
<Figure size 720x432 with 0 Axes>
If we go back to when we looked at the data as whole, when plotting the regression plots of the averages met_online was an outlier, but I wonder if that was because of all the time other ways_met had ahead of the online era. In the last decade we found that met_online was the most popular way couples met in 2010. I wonder how dating has been impacted in the online era, as well as the other ways.
met_after_2010 = data[data["year_met"] >= 2010]
matplotlib.rcParams['figure.figsize'] = [10,6]
fig, ax = plt.subplots()
ax = sns.violinplot(x="ways_met", y="relationship_duration",data=met_after_2010)
ax.set_xlabel("Year")
ax.set_ylabel("Relationship Duration")
ax.set_title("Way Couple Met v Relationship Duration")
plt.xticks(rotation=70)
plt.tight_layout()
Compared to the violin plot that looks at all the data, the data after 2010 looks more evenly distributed, but met_online still seemed to have a shorter relationship duration.
ways2010mean = met_after_2010.groupby("ways_met").mean()
ways2010mean['ways_met']= ways2010mean.index.values
ways2010mean.head()
| CaseID | CASEID_NEW | weight1 | weight1_freqwt | weight2 | weight1a | weight1a_freqwt | weight_combo | weight_combo_freqwt | duration | ... | partner_yrsed | subject_yrsed | relationship_quality_coded | relationship_quality_very_poor | relationship_quality_poor | relationship_quality_fair | relationship_quality_good | relationship_quality_excellent | age_diff | ways_met | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| ways_met | |||||||||||||||||||||
| met_bar_or_resturant | 2384.370370 | 2.403002e+06 | 1.008645 | 79014.703125 | 0.901214 | 1.002860 | 78561.648438 | 0.912396 | 63329.554688 | 17.703704 | ... | 13.444445 | 13.351851 | 3.173913 | 1.0 | 0.0 | 0.173913 | 0.304348 | 0.0 | 8.074074 | met_bar_or_resturant |
| met_church | 2461.866667 | 2.103780e+06 | 1.076133 | 84301.468750 | 0.830450 | 1.069960 | 83818.070312 | 1.097548 | 76180.867188 | 21.866667 | ... | 14.266666 | 13.400000 | 3.666667 | 1.0 | 0.0 | 0.000000 | 0.333333 | 0.0 | 4.066667 | met_church |
| met_college | 2217.333333 | 2.439922e+06 | 1.099360 | 86121.000000 | 0.750575 | 1.093060 | 85627.500000 | 0.896396 | 62219.000000 | 60.333333 | ... | 15.333333 | 14.666667 | 3.600000 | 1.0 | 0.0 | 0.100000 | 0.200000 | 0.0 | 1.500000 | met_college |
| met_coworkers | 2102.217949 | 2.352373e+06 | 1.087667 | 85205.046875 | 0.874471 | 1.081427 | 84716.289062 | 1.030886 | 71553.960938 | 168.153846 | ... | 14.269231 | 14.128205 | 3.450000 | 1.0 | 0.0 | 0.150000 | 0.250000 | 0.0 | 5.052632 | met_coworkers |
| met_customer_client | 2045.206897 | 2.135478e+06 | 1.054995 | 82645.710938 | 0.680362 | 1.048943 | 82171.476562 | 0.917060 | 63653.277344 | 198.655172 | ... | 14.103448 | 13.689655 | 3.217391 | 1.0 | 0.0 | 0.086957 | 0.434783 | 0.0 | 8.551724 | met_customer_client |
5 rows × 69 columns
regression_plot(ways2010mean,"relationship_quality_coded",
"relationship_duration","Average Relationship Quality vs Average relationship Duration in years ",
"Relationship Quality","Relationship Duration",True)
<Figure size 720x432 with 0 Axes>
regression_plot(ways2010mean,"time_from_met_to_relationship","relationship_duration",
"Average Time Known Before Relationship Started in days vs Average relationship Duration in years",
"Time From Meeting to Relationship","Average Relationship Duration",True)
<Figure size 720x432 with 0 Axes>
regression_plot(ways2010mean,"age_diff","relationship_duration","Average Age Diff vs Average relationship Duration in years ",
"Average Age Difference Between Partners (Years)","Average Relationship Duration",True)
<Figure size 720x432 with 0 Axes>
regression_plot(ways2010mean,"age_when_met","relationship_duration",
"Average Age When Respondent Met Partner vs Average relationship Duration",
"Average Age When Respondent Met Partner","Average Relationship Duration (Years)",True)
<Figure size 720x432 with 0 Axes>
Looking at the graphs, met_online is less of an outlier than before, in fact it seems to behave similarly to the the other methods of meeting. The graphs also seem to be generally negatively correlated but not too steep, the only exception being relationship quality vs relationship duration graph had a an outlier that made the regression line steeper than it needed to be for the graph the displays Averahe Relationship v. Relationship Duration.
This is the portion of the tutorial where I try and build a model to predict who long a relationship will last based on the features we analyzed. In doing so we will get an understanding of how good the predictive features I chose were.
The function clean_data() takes in a dataframe and removes the rows that have nan and infinite values to ignore so it would not affect the data learning model inputs
def clean_dataset(df):
assert isinstance(df, pd.DataFrame), "df has to be a pd.DataFrame"
df.dropna(inplace=True)
indices_to_keep = ~df.isin([np.nan, np.inf, -np.inf]).any(1)
return df[indices_to_keep].astype(np.float64)
Earlier I mentioned that I would not be dropping over 200 columns, and only grab the columns I need to for the inputs of the data and put them into a new dataframe. Called mini is the a smaller version of the big data datframe.
mini = pd.DataFrame()
mini['age_diff'] = data['age_diff']
mini['age_when_met'] = data['age_when_met']
mini['time_from_met_to_relationship'] = data['time_from_met_to_relationship']
mini['relationship_duration'] = data['relationship_duration']
mini['met_online'] = data['met_online']
#mini['year_met'] = data['year_met']
mini["relationship_quality_very_poor"] = data["relationship_quality_very_poor"]
mini["relationship_quality_poor"] = data["relationship_quality_poor"]
mini["relationship_quality_fair"] = data["relationship_quality_fair"]
mini["relationship_quality_good"] = data["relationship_quality_good"]
mini["relationship_quality_excellent"] = data["relationship_quality_excellent"]
for way in ways_met:
mini[way] = data[way]
mini = clean_dataset(mini)
mini.head()
| age_diff | age_when_met | time_from_met_to_relationship | relationship_duration | met_online | relationship_quality_very_poor | relationship_quality_poor | relationship_quality_fair | relationship_quality_good | relationship_quality_excellent | ... | met_party | met_primary_secondary_school | met_vacation | met_buisness_trip | met_bar_or_resturant | met_in_public | met_volunteer_organization | met_college | met_blind_date | met_in_person_dating_service | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 3.0 | 21.0 | 12.250000 | 21.916666 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 2 | 2.0 | 36.0 | 0.416748 | 11.083333 | 1.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 4 | 0.0 | 25.0 | 0.083252 | 33.750000 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
| 5 | 1.0 | 23.0 | 0.500000 | 35.083332 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
| 6 | 1.0 | 15.0 | 0.250000 | 50.500000 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
5 rows × 27 columns
Y = mini['relationship_duration']
X = mini.drop('relationship_duration',axis=1)
X.shape, Y.shape
((2682, 26), (2682,))
X_train, X_test, y_train, y_test = train_test_split(
X, Y, test_size=0.33, random_state=42)
Non-Linear SVM (Support Vector Machine), is SVM can be extended to solve nonlinear regression tasks when the set of samples cannot be separated linearly.
nLinSVM = make_pipeline(StandardScaler(), SVR(C=1.0, epsilon=0.2))
clf1 = nLinSVM.fit(X, Y)
#predicting the target value from the model for the samples
y_test_nl_svm = clf1.predict(X_test)
y_train_nl_svm = clf1.predict(X_train)
#computing the accuracy of the model performance
acc_train_nl_svm = clf1.score(X_train, y_train)
acc_test_nl_svm = clf1.score(X_test, y_test)
#computing root mean squared error (RMSE)
rmse_train_nl_svm = np.sqrt(mean_squared_error(y_train, y_train_nl_svm))
rmse_test_nl_svm = np.sqrt(mean_squared_error(y_test, y_test_nl_svm))
print("Non-Linear SVM: Accuracy on training Data: {:.3f}".format(acc_train_nl_svm))
print("Non-Linear : Accuracy on test Data: {:.3f}".format(acc_test_nl_svm))
print('\nNon-Linear: The RMSE of the training set is:', rmse_train_nl_svm)
print('Non-Linear: The RMSE of the testing set is:', rmse_test_nl_svm)
Non-Linear SVM: Accuracy on training Data: 0.248 Non-Linear : Accuracy on test Data: 0.240 Non-Linear: The RMSE of the training set is: 14.21962459857675 Non-Linear: The RMSE of the testing set is: 14.848016731628269
Non-Linear SVM (Support Vector Machine), is SVM can be extended to solve linear regression tasks when the set of samples can be separated linearly.
linSVM = make_pipeline(StandardScaler(),
LinearSVR(random_state=0, tol=1e-5))
clf2 = linSVM.fit(X, Y)
#predicting the target value from the model for the samples
y_test_l_svm = clf2.predict(X_test)
y_train_l_svm = clf2.predict(X_train)
#computing the accuracy of the model performance
acc_train_l_svm = clf2.score(X_train, y_train)
acc_test_l_svm = clf2.score(X_test, y_test)
#computing root mean squared error (RMSE)
rmse_train_l_svm = np.sqrt(mean_squared_error(y_train, y_train_l_svm))
rmse_test_l_svm = np.sqrt(mean_squared_error(y_test, y_test_l_svm))
print("Linear SVM: Accuracy on training Data: {:.3f}".format(acc_train_l_svm))
print("Linear SVM: Accuracy on test Data: {:.3f}".format(acc_test_l_svm))
print('\nLinear SVM: The RMSE of the training set is:', rmse_train_l_svm)
print('Linear SVM: The RMSE of the testing set is:', rmse_test_l_svm)
Linear SVM: Accuracy on training Data: 0.239 Linear SVM: Accuracy on test Data: 0.240 Linear SVM: The RMSE of the training set is: 14.304678061720681 Linear SVM: The RMSE of the testing set is: 14.846668430371793
Since the linear SVM had a higher accuracy than non linear SVM it must have performed better, there also so seemed to be less noise since the RMSE is smaller. Which might insinuate the data is linearly seperable but I am a little skepectical as to why that accuarcy is so high for linear SVM, but I think has to do with the years met column being highly correlated to relationship duration, because without it the accuarcy is about .24 and RMSE about 14, which is a lot of noise. Which leads me to think the other features were not as highly correlated with the duration of the relationship.
Based on my analysis relationships that started online do not really last long if you were looking at the dataset as whole. That being said because there is some relationship between the ways couples met and how long their relation lasted, along with age difference, the year they met, and their relationship quality since the accuarcy, but I would not consider it strong because the accuarcy for non linear svm is about .240 each on the test data without the feature 'year_met'. This is not that helpful because we already know with less time you had to know someone the shorter the relationship would be, but it is not as helpful for using other features for predicting relationship length.
As for online dating, it may not last the longest, if you look at the data as whole, but those who met online tended to view their relationships to have an excellent quality. If you look in the past decade people who met online are similar to other relationships in terms of relationship quality, and not to far off in duration either. So online dating is not drastically better or worse than other meeting options in the past decade.
Anderson, M., Vogels, E. A., & Turner, E. (2020, October 2). The virtues and downsides of online dating.
Pew Research Center: Internet, Science & Tech. Retrieved December 20, 2021, from
https://www.pewresearch.org/internet/2020/02/06/the-virtues-and-downsides-of-online-dating/
How couples meet and stay together 2017 (HCMST2017). How Couples Meet and Stay Together 2017 (HCMST2017) | SSDS
Social Science Data Collection. (n.d.). Retrieved December 20, 2021, from
https://data.stanford.edu/hcmst2017#download-data
Pupale, R. (2019, February 11). Support vector machines(svm) - an overview. Medium. Retrieved December 20,
2021, from https://towardsdatascience.com/https-medium-com-pupalerushikesh-svm-f4b42800e989